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ABSTRACT 
The paper discusses the current state of Sindhi corpus construction Keywords: COrpus 
in detail. Sindhi corpus development issues including corpus construction, unigram, 
acquisition, preprocessing, and tokenization are discussed in detail. bigram, trigram frequencies 
Preliminary results and observations which include letter unigram, orthography, script 


bigram and trigram frequencies; word frequencies and word bigram 
frequencies are presented. Current state of Sindhi corpus with its 
limitations and future work is also discussed. The paper also explores 
the orthography and script of Sindhi language with reference to 
corpus development. 


Introduction 


Sindhi is one of the major languages of Pakistan spoken by approximately 30-40 million people 
(Sindhi Language Authority, 2009), (Collie. J., 2009). Sindhi is being frequently used on internet. 
Sindhi blogs, literary websites, online newspapers and discussion forums are increasing day by 
day. After Urdu, Sindhi is the second largest written language of Pakistan. Despite of its online 
usage and popularity only few language processing resources are available for NLP researchers 
which include lexicon, fonts and simple word processors. The development of Sindhi 
language processing resources like linguistic corpora and comprehensive computational 


lexicon are not even initiated. Sindhi is being written in Persio- Arabic (E32), Devnagri (Tae eff) 


and roman (sindhi) scripts. Persio- Arabic script is most common script for Sindhi writings in 
Pakistan and India. Devnagri script is also being used for Sindhi writing in India. Roman script 
(though not yet standardized) is also getting popularity. Very few written documents are available 
in roman script but it is being used frequently for communications on internet and cell phones and 
other smart devices. Due to the fact that most of the online and offline written material of Sindhi 
is available in Persio-Arabic script Sindhi corpus being constructed is in Persio-Arabic script using 
UTF-16 encoding. 

Following sections discuss the existing work in Pakistani language corpora, 
orthography and script of Sindhi Language, corpus construction issues, corpus acquisition, pre- 
processing, tokenization and results of preliminary statistical analysis. Finally the future work 1s 
discussed along-with conclusion. 
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Previous work 


Apart from fonts, keyboard design (Bhurgri, 2010a) and few digital dictionaries (CRULP, 
2010a) Sindhi language processing resources are not available publically. Studies or 
development projects for resources like linguistic corpora and comprehensive computational 
lexicon are not even initiated. Various research organizations and individuals are working for 
the development of linguistic corpora of different Pakistani languages. For Urdu EMILLE 
(Mentery et al., 2000), Baker Riaz corpus (Becker & Riaz, 2002),jang newspaper corpus 
(Hussain, 2008), and parallel English Urdu and Nepali corpus (CRULP, 2010b) are some key 
examples. For Pashto the projects include BBN Byblos Pashto OCR System (Decerbo et 
al.,2004) and Machine readable Pashto text corpus being developed at University of Peshawar 
(Khan & Zuhra, 2007). The first Punjabi language corpus was developed by Central Institute of 
Indian Languages (CIIL) India (Lehal, 2009). Hindi and Punjabi parallel corpus developed by 
CDAC Noida is another useful linguistic corpora available. One cannot find such type of 
linguistic corpora for Sindhi, Balouchi, Siraiki and many other Pakistani languages. In contrast 
to other Pakistani languages (Excluding Urdu) Sindhi text in electronic format is easily 
available and is being continuously collected for corpus under discussion. 


Orthography and script of Sindhi language 


Sindhi is written in Persio-Arabic script based on extended Arabic character set in Naskh style. 
Sindhi alphabet is comprised of 52 letters shown in figure 1. The alphabet contains basic letters 
likey  &xand secondary letters like ¢«and-S¢which are aspirated versions of cand £. 


C 

SQM c (s 
eo ap 
END 


j 
C 
X 

OQ» 
S 
3 
k 


m Cac 
^ ct (UO ^" L 


io 


— Cla N ew 


— 
— 
— 





Figure 1. Sindhi alphabet 


40 


Sindhi words always end in a vowel (Rahman, 2009); this vocalic ending is optionally marked 
by diacritics in written text. Diacritics are also used inside words to represent additional vocal 
features. Absence of diacritics in written text sometimes causes semantic ambiguities. For instance 
the word (to push) and + (bog) are semantically ambiguous without diacritics. Diacritics 
used in Sindhi are shown in Figure 2. 





Figure 2. Diacritics used in Sindhi. 


Sindhi has its own numerals based on Persio-Arabic numerals shown in figure 3. Use of 
Hindu-Arabic numerals is also very common in Sindhi writings. Special symbols shown in 
figure 3 are also used in Sindhi written text. 





Special Symbols 


"S ' 
ae 3 


- P ur ») 


Numerals 


FACTOR rtr 








Figure 3. Special symbols and numerals used in Sindhi written text. 
Sindhi corpus development 


After Unicode support and Unicode based Sindhi keyboard design (Bhurgri, 2010b) 
availability of Unicode based Sindhi text on Internet is increasing day by day. Key factor 
behind the motivation of Sindhi corpus construction is availability of online text in Sindhi 
newspapers, blogs, and literary websites and discussion forums. Despite of the fact that available 
online resources do not provide huge amount of text but they are increasing day by day and 
corpus is being collected continuously. Software routines for preprocessing, normalization, 
tokenization and frequency calculation are implemented in C# using Microsoft.net framework 
libraries. 
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Corpus acquisition 


Data is gathered from various domains which include news, blogs, literature, essays, and letters. 
Different subdomains include current affairs, sports, showbiz, short stories, discussions and 
opinions. Sources of data collection are shown in Table 1. 


Table 1. Sources of data collection 


Source URL(s) 
Daily Kawish http://www.thekawish.com 


Daily Awami Awaz http://www.awamiawaz.com 


Daily Ibrat http://dailyibrat.com 
Blogs http://shikarpuri.wordpress.com 
Literary Writings http://voiceofsindh.nethttp://sindhsalamat.com 


Preprocessing and normalization 


Almost all data gathered was already in Unicode format but nevertheless all the collected text 
is converted into standard UTF-16 encoding. Letters represented by multiple Unicode points 
and equivalent representations of composed and decomposed form (Hussain & Durrani, 2008) 
are reduced to same underlying form. Letters with aspirated versions like © -Sówhich are 
combinations of two Unicode characters (for instance Sand in case of 8%) are considered 
single letters while dealing with text processing. 


Tokenization 


For tokenization white spaces, punctuation markers, special symbols (like $, %, # etc.) and digits 
are used as word boundaries. White space word boundary consideration caused problem of 
embedded space word breaking (For example the single word J#bLis divided into two words 
bxcand Ju) is tackled out by using the same technique used for Urdu (Ijaz & Hussain, 2007). 
Another problem in Sindhi word tokenization occurs when two special words a(in) and ¢(and) 
occurred without space like eYéca(me: mila:i a) and this was tokenized as a single word. 
Also in case of ¢ij=“ékita:baainqalama (book and pen) in which three words without space are 
there and were tokenized as single word. Same problem was observed with all the words with 
non-connective ending like Suiek"i:rapi:a (drink milk) or starting letters ¢‘!4u-sind"aander(in 
Sindh). Semiautomatic (software based + manual) approach was used to overcome this problem. 


Results and observations 


A total of 4.1 million word corpus analyzed quantitatively. This preliminary analysis includes 
letter frequency analysis, letter bigram analysis, letter trigram analysis, word frequency 
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analysis, and word bigram analysis. These quantitative results are discussed in following 
sections. 


Letter frequencies 


A total of 13,968,112 characters in the corpus were analyzed while calculating letter frequencies. 
Along with 52 letters of Sindhi alphabet was also considered as a single letter because of its use 
in Sindhi keyboard as a single letter and single Unicode representation. It was observed that 
most frequently occurred letter was vowel «while least frequently occurred letter was consonant 
S Table 2 shows top 20 most frequently occurred letters in Sindhi corpus with their percentage. 
While analyzing frequencies it was observed that frequency distribution of individual letters in 
single file of 50,000 or more words was identical to the letter frequency distribution of whole 
corpus. This similarity can be seen in graphs of figure 4 and 5. 

Letter bigram and trigram frequencies were also analyzed. It can be seen that almost 


50% of top 20 most frequent bigrams are valid two letter words like \s, ez, wand 4s. Same is 
the case with trigrams where this ratio is more than 60%. Top 20 most frequent bigram and 
trigram percentages are shown in Tables 3 and 4 respectively. 


Table 2. Top 20 most frequent letters. 


S.No. letter Percent S.No. Letter Percent 
1 US 13.77% 1] S 3.25% 
2 | 11.4290 12 h 3.23% 
3 S 8.99% 13 A 2.50% 
4 7.84% 14 8 2.00% 
5 6.27% 15 2 1.80% 
6 C» 6.15% 16 | 1.18% 
7 e 3.73% 17 Ö 1.16% 
8 c 3.64% 18 S 1.16% 
9 J 3.30% 19 E 0.99% 
10 Sur 3.26% 20 c 0.94% 


Table 3. Top 20 bigrams in Sindhi corpus 


S.No. Bigram Percent S.No.  Bigram Percent 
1 L 3.16% 11 és 1.18% 
2 c 2.55% 12 is 1.10% 
3 UY 1.95% 13 | 1.10% 
4 és 1.80% 14 d 1.07% 
5 T 1.79% 15 | 1.02% 
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6 m 1.7996 16 y 1.0196 
7 és 1.79% 17 -$3 0.99% 
8 ud 1.6996 18 is — 0.9796 
9 A 1.2896 19 Y 0.95% 
10 3 1.27796 20 cu —— 0.9396 


Sindhi Letter 
Frequencies 


2000000 


EN Letter 





Figure 4 Letter frequency distribution in Sindhi corpus 


Single File Sindhi 
Letter Frequencies 


20000 
O Med ntl M@ Letter 


e c c Ż ouaaa Æ e 





Figure 5 Letter frequency distribution in a single file 


Word frequencies 


Total of 4.1 million words were analyzed and 70,576 distinct word forms were found. Most 
frequently occurring words include case markers (like a, Cand 4s) and auxiliary/incomplete 
verbs (like Isand ls). Postposition ¢whas highest frequency of occurrence as shown in Table 
5. Word bigram occurrences are also calculated and are shown in Table 6. The proper name 
bigram ¿Š 5b; s among the top 10 bigrams. This is because of the current affairs domain 
contains essays and newspaper columns about the life of former Prime Minister Benazir 
Bhutto. 
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Table 4. Top 20 letter trigrams in Sindhi Corpus 


S.No. Trigram Percent S.No. Trigram Percent 
l Q 1.40% 11 ode 0.45% 
2 Gos 1.34% 12 E 0.44% 
3 m 0.81% 13 Ss 0.44% 
4 bes 0.74% 14 fe 0.42% 
5 £u 0.71% 15 ~ 0.41% 
6 Sis 0.61% 16 G 0.40% 
7 ja 0.60% 17 c 0.36% 
8 OXs 0.53% 18 à 0.35% 
9 Sloe 0.47% 19 wit 0.35% 
10 Au 0.46% 20 shes 0.35% 


Table 5. Top 20 most frequent words in Sindhi corpus 


S.No. word Percent S.No. word Percent 
] ca 3.71% 11 we 0.69% 
2 e 2.44% 12 e 0.69% 
3 ç 2.17% 13 L 0.67% 
4 c 1.78% 14 Aus 0.63% 
5 Q 1.61% 15 MT 0.57% 
6 EY 1.61% 16 [E 0.55% 
7 c 1.50% 17 Ye 0.51% 
8 2j 1.05% 18 és 0.50% 
9 c 0.82% 19 ae 0.50% 
10 OO 0.71% 20 Sos 0.46% 


Table 6. Top 10 most frequent word bigrams 


S.No. Word bigram Percentage 
1 Dgo 7.52 
2 a 6.75 
3 e 2.66 
4 2 da s 1.93 
2 eiue 1.84 
6 al 1.72 
7 Sek 1.60 
8 e ts 1.60 
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9 S23 5 1.44 
b (s 1.21 





Future work 


Corpus is being continuously collected and results are being updated. Currently corpus is 
simply a UTF16 encoded text collection. Studies are in progress for proper annotations, POS 
tagging, corpus based lexicon development and n-gram based text categorization. 

Sindhi tokenization algorithm need to be worked out for the problems discussed in section 
4.3. Due to absence of standard sentence termination punctuation marker in Sindhi; full stop 
comma and other punctuation markers are used as sentence terminators in Sindhi text writings. 
Sentence segmentation is another key area to be worked out. More specific Sindhi 
computational linguistic studies are needed for further development and maturity of corpus. 
For example currently there is no comprehensive POS tagging algorithm available for Sindhi. 
Presently available POS tagging algorithm for Sindhi (Mahar & Memon, 2010) need to be 
analyzed and extended further. Sindhi tag set needs to be designed before POS tagging of the 
corpus. Qualitative, quantitative improvements, proper annotations and comprehensive statistical 
analysis are areas to be extensively worked out. 


Conclusion 


In absence of language processing resources of Sindhi language Sindhi corpus construction 
project is a valuable initiative. Regardless of its size and preliminary results the corpus in its 
current state will provide basis for further natural language processing studies of Sindhi 
language. Letter frequencies including bigram and trigram frequencies provide basis for 
intelligent text processing and compact keyboard design for cell phones and other smart 
devices. Word level unigram and bigram frequencies provide basis for spelling corrections and 
automatic sentence completion applications. Further developments in corpus will be useful for 
advanced language processing tasks like morphological analysis, syntax analysis, semantic 
analysis, information retrieval and extraction and machine translation. 
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